2022-10-27

(PART) Machine Learning

Introduction to machine learning

  • Perhaps the most popular data science methodologies come from the field of machine learning.

  • Machine learning success stories include the handwritten zip code readers implemented by the postal service, speech recognition technology such as Apple’s Siri, movie recommendation systems, spam and malware detectors, housing price predictors, and driverless cars.

Introduction to machine learning

  • Although today Artificial Intelligence and machine learning are often used interchangeably, we make the following distinction: while the first artificial intelligence algorithms, such as those used by chess playing machines, implemented decision making based on programmable rules derived from theory or first principles, in machine learning decisions are based on algorithms built with data.

Notation

  • In machine learning, data comes in the form of:

    1. the outcome we want to predict and.

    2. the features that we will use to predict the outcome.

  • We want to build an algorithm that takes feature values as input and returns a prediction for the outcome when we don’t know the outcome.

  • The machine learning approach is to train an algorithm using a dataset for which we do know the outcome, and then apply this algorithm in the future to make a prediction when we don’t know the outcome.

Notation

  • Here we will use \(Y\) to denote the outcome and \(X_1, \dots, X_p\) to denote features.

  • Note that features are sometimes referred to as predictors or covariates.

  • We consider all these to be synonyms.

  • Prediction problems can be divided into categorical and continuous outcomes.

  • For categorical outcomes, \(Y\) can be any one of \(K\) classes.

  • The number of classes can vary greatly across applications.

Notation

  • For example, in the digit reader data, \(K=10\) with the classes being the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.

  • In speech recognition, the outcomes are all possible words or phrases we are trying to detect.

  • Spam detection has two outcomes: spam or not spam.

  • In this book, we denote the \(K\) categories with indexes \(k=1,\dots,K\).

  • However, for binary data we will use \(k=0,1\) for mathematical conveniences that we demonstrate later.

Notation

  • The general setup is as follows.

  • We have a series of features and an unknown outcome we want to predict:

outcome feature 1 feature 2 feature 3 feature 4 feature 5
? \(X_1\) \(X_2\) \(X_3\) \(X_4\) \(X_5\)

Notation

  • To build a model that provides a prediction for any set of observed values \(X_1=x_1, X_2=x_2, \dots X_5=x_5\), we collect data for which we know the outcome:
outcome feature 1 feature 2 feature 3 feature 4 feature 5
\(y_{1}\) \(x_{1,1}\) \(x_{1,2}\) \(x_{1,3}\) \(x_{1,4}\) \(x_{1,5}\)
\(y_{2}\) \(x_{2,1}\) \(x_{2,2}\) \(x_{2,3}\) \(x_{2,4}\) \(x_{2,5}\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(y_n\) \(x_{n,1}\) \(x_{n,2}\) \(x_{n,3}\) \(x_{n,4}\) \(x_{n,5}\)

Notation

  • When the output is continuous we refer to the machine learning task as prediction, and the main output of the model is a function \(f\) that automatically produces a prediction, denoted with \(\hat{y}\), for any set of predictors: \(\hat{y} = f(x_1, x_2, \dots, x_p)\).

  • We use the term actual outcome to denote what we ended up observing.

  • So we want the prediction \(\hat{y}\) to match the actual outcome \(y\) as well as possible.

Notation

  • Because our outcome is continuous, our predictions \(\hat{y}\) will not be either exactly right or wrong, but instead we will determine an error defined as the difference between the prediction and the actual outcome \(y - \hat{y}\).

  • When the outcome is categorical, we refer to the machine learning task as classification, and the main output of the model will be a decision rule which prescribes which of the \(K\) classes we should predict.

  • In this scenario, most models provide functions of the predictors for each class \(k\), \(f_k(x_1, x_2, \dots, x_p)\), that are used to make this decision.

Notation

  • When the data is binary a typical decision rules looks like this: if \(f_1(x_1, x_2, \dots, x_p) > C\), predict category 1, if not the other category, with \(C\) a predetermined cutoff.

  • Because the outcomes are categorical, our predictions will be either right or wrong.

  • Notice that these terms vary among courses, text books, and other publications.

  • Often prediction is used for both categorical and continuous outcomes, and the term regression can be used for the continuous case.

Notation

  • Here we avoid using regression to avoid confusion with our previous use of the term linear regression.

  • In most cases it will be clear if our outcomes are categorical or continuous, so we will avoid using these terms when possible.

An example

  • Let’s consider the zip code reader example.

  • The first step in handling mail received in the post office is sorting letters by zip code:

  • Originally, humans had to sort these by hand.

  • To do this, they had to read the zip codes on each letter.

  • Today, thanks to machine learning algorithms, a computer can read zip codes and then a robot sorts the letters.

An example

  • In this part of the book, we will learn how to build algorithms that can read a digit.

  • The first step in building an algorithm is to understand.

  • what are the outcomes and features.

  • Below are three images of written digits.

  • These have already been read by a human and assigned an outcome \(Y\).

  • These are considered known and serve as the training set.

An example

An example

  • The images are converted into \(28 \times 28 = 784\) pixels and, for each pixel, we obtain a grey scale intensity between 0 (white) and 255 (black), which we consider continuous for now.

  • The following plot shows the individual features for each image:

An example

An example

  • For each digitized image \(i\), we have a categorical outcome \(Y_i\) which can be one of 10 values (\(0,1,2,3,4,5,6,7,8,9\)), and features \(X_{i,1}, \dots, X_{i,784}\).

  • We use bold face \(\mathbf{X}_i = (X_{i,1}, \dots, X_{i,784})\) to distinguish the vector of predictors from the individual predictors.

  • When referring to an arbitrary set of features rather than a specific image in our dataset, we drop the index \(i\) and use \(Y\) and \(\mathbf{X} = (X_{1}, \dots, X_{784})\).

  • We use upper case variables because, in general, we think of the predictors as random variables.

An example

  • We use lower case, for example \(\mathbf{X} = \mathbf{x}\), to denote observed values.

  • When we code we stick to lower case.

  • The machine learning task is to build an algorithm that returns a prediction for any of the possible values of the features.

  • Here, we will learn several approaches to building these algorithms.

An example

  • Although at this point it might seem impossible to achieve this, we will start with simple examples and build up our knowledge until we can attack more complex ones.

  • In fact, we start with an artificially simple example with just one predictor and then move on to a slightly more realistic example with two predictors.

  • Once we understand these, we will attack real-world machine learning challenges involving many predictors.